def cached(func_key, override=False):
"""
A decorator to cache function results
Parameters
==========
key: str
File basename used to save pickled results
override: bool
When True, re-compute even if the results are already stored
"""
def compute_and_cache_decorator(compute_function):
"""
Wrap the caching function
Parameters
==========
compute_function: function
The function to run and cache results
"""
def compute_and_cache(*args, **kwargs):
"""
Perform a computation and cache, or load cached result.
Parameters
==========
args
Positional arguments for the compute function
kwargs
Keyword arguments for the compute function
"""
# Add an identifier from the particular function call
if 'cache_key' in kwargs:
key = '_'.join((func_key, kwargs['cache_key']))
else:
key = func_key
path = os.path.join(
et.io.HOME, et.io.DATA_NAME, 'jars', f'{key}.pickle')
# Check if the cache exists already or override caching
if not os.path.exists(path) or override:
# Make jars directory if needed
os.makedirs(os.path.dirname(path), exist_ok=True)
# Run the compute function as the user did
result = compute_function(*args, **kwargs)
# Pickle the object
with open(path, 'wb') as file:
pickle.dump(result, file)
else:
# Unpickle the object
with open(path, 'rb') as file:
result = pickle.load(file)
return result
return compute_and_cache
return compute_and_cache_decoratorLand cover classification at the Mississippi Delta
In this notebook, you will use a k-means unsupervised clustering algorithm to group pixels by similar spectral signatures. k-means is an exploratory method for finding patterns in data. Because it is unsupervised, you don’t need any training data for the model. You also can’t measure how well it “performs” because the clusters will not correspond to any particular land cover class. However, we expect at least some of the clusters to be identifiable as different types of land cover.
You will use the harmonized Sentinal/Landsat multispectral dataset. You can access the data with an Earthdata account and the earthaccess library from NSIDC:
STEP 1: SET UP
- Import all libraries you will need for this analysis
- Configure GDAL parameters to help avoid connection errors:
python os.environ["GDAL_HTTP_MAX_RETRY"] = "5" os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
See our solution!
import os
import pickle
import re
import warnings
import cartopy.crs as ccrs
import earthaccess
import earthpy as et
import geopandas as gpd
import geoviews as gv
import hvplot.pandas
import hvplot.xarray
import numpy as np
import pandas as pd
import rioxarray as rxr
import rioxarray.merge as rxrmerge
from tqdm.notebook import tqdm
import xarray as xr
from shapely.geometry import Polygon
from sklearn.cluster import KMeans
os.environ["GDAL_HTTP_MAX_RETRY"] = "5"
os.environ["GDAL_HTTP_RETRY_DELAY"] = "1"
warnings.simplefilter('ignore')Below you can find code for a caching decorator which you can use in your code. To use the decorator:
@cached(key, override)
def do_something(*args, **kwargs):
...
return item_to_cacheThis decorator will pickle the results of running the do_something() function, and only run the code if the results don’t already exist. To override the caching, for example temporarily after making changes to your code, set override=True. Note that to use the caching decorator, you must write your own function to perform each task!
STEP 2: STUDY SITE
For this analysis, you will use a watershed from the Water Boundary Dataset, HU12 watersheds (WBDHU12.shp).
- Download the Water Boundary Dataset for region 8 (Mississippi)
- Select watershed 080902030506
- Generate a site map of the watershed
Try to use the caching decorator
See our solution!
@cached('wbd_08')
def read_wbd_file(wbd_filename, huc_level, cache_key):
# Download and unzip
wbd_url = (
"https://prd-tnm.s3.amazonaws.com"
"/StagedProducts/Hydrography/WBD/HU2/Shape/"
f"{wbd_filename}.zip")
wbd_dir = et.data.get_data(url=wbd_url)
# Read desired data
wbd_path = os.path.join(wbd_dir, 'Shape', f'WBDHU{huc_level}.shp')
wbd_gdf = gpd.read_file(wbd_path, engine='pyogrio')
return wbd_gdf
huc_level = 12
wbd_gdf = read_wbd_file(
"WBD_08_HU2_Shape", huc_level, cache_key=f'hu{huc_level}')
delta_gdf = (
wbd_gdf[wbd_gdf[f'huc{huc_level}']
.isin(['080902030506'])]
.dissolve()
)
(
delta_gdf.to_crs(ccrs.Mercator())
.hvplot(
alpha=.2, fill_color='white',
tiles='EsriImagery', crs=ccrs.Mercator())
.opts(width=600, height=300)
)Downloading from https://prd-tnm.s3.amazonaws.com/StagedProducts/Hydrography/WBD/HU2/Shape/WBD_08_HU2_Shape.zip
Extracted output to /home/runner/earth-analytics/data/earthpy-downloads/WBD_08_HU2_Shape
We chose this watershed because it covers parts of New Orleans an is near the Mississippi Delta. Deltas are boundary areas between the land and the ocean, and as a result tend to contain a rich variety of different land cover and land use types.
Write a 2-3 sentence site description (with citations) of this area that helps to put your analysis in context.
YOUR SITE DESCRIPTION HERE
STEP 3: MULTISPECTRAL DATA
Search for data
- Log in to the
earthaccessservice using your Earthdata credentials:python earthaccess.login(persist=True) - Modify the following sample code to search for granules of the HLSL30 product overlapping the watershed boundary from May to October 2023 (there should be 76 granules):
python results = earthaccess.search_data( short_name="...", cloud_hosted=True, bounding_box=tuple(gdf.total_bounds), temporal=("...", "..."), )
# Log in to earthaccess
# Search for HLS tilesSee our solution!
# Log in to earthaccess
earthaccess.login(persist=True)
# Search for HLS tiles
results = earthaccess.search_data(
short_name="HLSL30",
cloud_hosted=True,
bounding_box=tuple(delta_gdf.total_bounds),
temporal=("2024-06", "2024-08"),
)Compile information about each granule
I recommend building a GeoDataFrame, as this will allow you to plot the granules you are downloading and make sure they line up with your shapefile. You could also use a DataFrame, dictionary, or a custom object to store this information.
- For each search result:
- Get the following information (HINT: look at the [‘umm’] values for each search result):
- granule id (UR)
- datetime
- geometry (HINT: check out the shapely.geometry.Polygon class to convert points to a Polygon)
- Open the granule files. I recomment opening one granule at a time, e.g. with (
earthaccess.open([result]). - For each file (band), get the following information:
- file handler returned from
earthaccess.open() - tile id
- band number
- file handler returned from
- Get the following information (HINT: look at the [‘umm’] values for each search result):
- Compile all the information you collected into a GeoDataFrame
# Loop through each granule
# Get granule information
# Get URL
# Build metadata DataFrame rows
# Concatenate metadata DataFrameSee our solution!
def get_earthaccess_links(results):
url_re = re.compile(
r'\.(?P<tile_id>\w+)\.\d+T\d+\.v\d\.\d\.(?P<band>[A-Za-z0-9]+)\.tif')
# Loop through each granule
link_rows = []
url_dfs = []
for granule in tqdm(results):
# Get granule information
info_dict = granule['umm']
granule_id = info_dict['GranuleUR']
datetime = pd.to_datetime(
info_dict
['TemporalExtent']['RangeDateTime']['BeginningDateTime'])
points = (
info_dict
['SpatialExtent']['HorizontalSpatialDomain']['Geometry']['GPolygons'][0]
['Boundary']['Points'])
geometry = Polygon(
[(point['Longitude'], point['Latitude']) for point in points])
# Get URL
files = earthaccess.open([granule])
# Build metadata DataFrame
for file in files:
match = url_re.search(file.full_name)
if match is not None:
link_rows.append(
gpd.GeoDataFrame(
dict(
datetime=[datetime],
tile_id=[match.group('tile_id')],
band=[match.group('band')],
url=[file],
geometry=[geometry]
),
crs="EPSG:4326"
)
)
# Concatenate metadata DataFrame
file_df = pd.concat(link_rows).reset_index(drop=True)
return file_dfOpen, crop, and mask data
This will be the most resource-intensive step. I recommend caching your results using the cached decorator or by writing your own caching code. I also recommend testing this step with one or two dates before running the full computation.
This code should include at least one function including a numpy-style docstring. A good place to start would be a function for opening a single masked raster, applying the appropriate scale parameter, and cropping.
- For each granule:
Open the Fmask band, crop, and compute a quality mask for the granule. You can use the following code as a starting point, making sure that
mask_bitscontains the quality bits you want to consider: ```python # Expand into a new dimension of binary bits bits = ( np.unpackbits(da.astype(np.uint8), bitorder=‘little’) .reshape(da.shape + (-1,)) )# Select the required bits and check if any are flagged mask = np.prod(bits[…, mask_bits]==0, axis=-1) ```
For each band that starts with ‘B’:
- Open the band, crop, and apply the scale factor
- Name the DataArray after the band using the
.nameattribute - Apply the cloud mask using the
.where()method - Store the DataArray in your data structure (e.g. adding a GeoDataFrame column with the DataArray in it. Note that you will need to remove the rows for unused bands)
# Loop through each image
# Open granule cloud cover
# Compute cloud mask
# Loop through each spectral band
# Open, crop, and mask the band
# Add the DataArray to the metadata DataFrame row
# Reassemble the metadata DataFrameSee our solution!
@cached('delta_reflectance_da_df')
def compute_reflectance_da(search_results, boundary_gdf):
"""
Connect to files over VSI, crop, cloud mask, and wrangle
Returns a single reflectance DataFrame
with all bands as columns and
centroid coordinates and datetime as the index.
Parameters
==========
file_df : pd.DataFrame
File connection and metadata (datetime, tile_id, band, and url)
boundary_gdf : gpd.GeoDataFrame
Boundary use to crop the data
"""
def open_dataarray(url, boundary_proj_gdf, scale=1, masked=True):
# Open masked DataArray
da = rxr.open_rasterio(url, masked=masked).squeeze() * scale
# Reproject boundary if needed
if boundary_proj_gdf is None:
boundary_proj_gdf = boundary_gdf.to_crs(da.rio.crs)
# Crop
cropped = da.rio.clip_box(*boundary_proj_gdf.total_bounds)
return cropped
def compute_quality_mask(da, mask_bits=[1, 2, 3]):
"""Mask out low quality data by bit"""
# Unpack bits into a new axis
bits = (
np.unpackbits(
da.astype(np.uint8), bitorder='little'
).reshape(da.shape + (-1,))
)
# Select the required bits and check if any are flagged
mask = np.prod(bits[..., mask_bits]==0, axis=-1)
return mask
file_df = get_earthaccess_links(search_results)
granule_da_rows= []
boundary_proj_gdf = None
# Loop through each image
group_iter = file_df.groupby(['datetime', 'tile_id'])
for (datetime, tile_id), granule_df in tqdm(group_iter):
print(f'Processing granule {tile_id} {datetime}')
# Open granule cloud cover
cloud_mask_url = (
granule_df.loc[granule_df.band=='Fmask', 'url']
.values[0])
cloud_mask_cropped_da = open_dataarray(cloud_mask_url, boundary_proj_gdf, masked=False)
# Compute cloud mask
cloud_mask = compute_quality_mask(cloud_mask_cropped_da)
# Loop through each spectral band
da_list = []
df_list = []
for i, row in granule_df.iterrows():
if row.band.startswith('B'):
# Open, crop, and mask the band
band_cropped = open_dataarray(
row.url, boundary_proj_gdf, scale=0.0001)
band_cropped.name = row.band
# Add the DataArray to the metadata DataFrame row
row['da'] = band_cropped.where(cloud_mask)
granule_da_rows.append(row.to_frame().T)
# Reassemble the metadata DataFrame
return pd.concat(granule_da_rows)
reflectance_da_df = compute_reflectance_da(results, delta_gdf)Processing granule T15RYN 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYP 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBT 2024-06-07 16:31:11.509000+00:00
Processing granule T16RBU 2024-06-07 16:31:11.509000+00:00
Processing granule T15RYN 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYP 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBT 2024-06-15 16:31:19.154000+00:00
Processing granule T16RBU 2024-06-15 16:31:19.154000+00:00
Processing granule T15RYN 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYP 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBT 2024-06-23 16:31:21.277000+00:00
Processing granule T16RBU 2024-06-23 16:31:21.277000+00:00
Processing granule T15RYN 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYP 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBT 2024-07-01 16:31:17.338000+00:00
Processing granule T16RBU 2024-07-01 16:31:17.338000+00:00
Processing granule T15RYN 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYP 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBT 2024-07-09 16:31:29.187000+00:00
Processing granule T16RBU 2024-07-09 16:31:29.187000+00:00
Processing granule T15RYN 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYP 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBT 2024-07-17 16:31:33.271000+00:00
Processing granule T16RBU 2024-07-17 16:31:33.271000+00:00
Processing granule T15RYN 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYP 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBT 2024-07-25 16:31:43.265000+00:00
Processing granule T16RBU 2024-07-25 16:31:43.265000+00:00
Processing granule T15RYN 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYP 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBT 2024-08-02 16:31:36.413000+00:00
Processing granule T16RBU 2024-08-02 16:31:36.413000+00:00
Processing granule T15RYN 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYP 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBT 2024-08-10 16:31:42.335000+00:00
Processing granule T16RBU 2024-08-10 16:31:42.335000+00:00
Processing granule T15RYN 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYP 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBT 2024-08-18 16:31:46.256000+00:00
Processing granule T16RBU 2024-08-18 16:31:46.256000+00:00
Processing granule T15RYN 2024-08-26 16:31:51.172000+00:00
Processing granule T15RYP 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBT 2024-08-26 16:31:51.172000+00:00
Processing granule T16RBU 2024-08-26 16:31:51.172000+00:00
Merge and Composite Data
You will notice for this watershed that: 1. The raster data for each date are spread across 4 granules 2. Any given image is incomplete because of clouds
For each band:
For each date:
- Merge all 4 granules
- Mask any negative values created by interpolating from the nodata value of -9999 (
rioxarrayshould account for this, but doesn’t appear to when merging. If you leave these values in they will create problems down the line)
Concatenate the merged DataArrays along a new date dimension
Take the mean in the date dimension to create a composite image that fills cloud gaps
Add the band as a dimension, and give the DataArray a name
Concatenate along the band dimension
# Merge and composite and image for each band
# Merge granules for each date
# Mask negative values
# Composite images across datesSee our solution!
@cached('delta_reflectance_da')
def merge_and_composite_arrays(granule_da_df):
# Merge and composite and image for each band
df_list = []
da_list = []
for band, band_df in tqdm(granule_da_df.groupby('band')):
merged_das = []
for datetime, date_df in tqdm(band_df.groupby('datetime')):
# Merge granules for each date
merged_da = rxrmerge.merge_arrays(list(date_df.da))
# Mask negative values
merged_da = merged_da.where(merged_da>0)
merged_das.append(merged_da)
# Composite images across dates
composite_da = xr.concat(merged_das, dim='datetime').median('datetime')
composite_da['band'] = int(band[1:])
composite_da.name = 'reflectance'
da_list.append(composite_da)
return xr.concat(da_list, dim='band')
reflectance_da = merge_and_composite_arrays(reflectance_da_df)
reflectance_da<xarray.DataArray 'reflectance' (band: 10, y: 556, x: 624)> Size: 14MB
array([[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
...
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]],
[[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]]], dtype=float32)
Coordinates:
* x (x) float64 5kB 7.926e+05 7.926e+05 ... 8.112e+05 8.113e+05
* y (y) float64 4kB 3.304e+06 3.304e+06 ... 3.287e+06 3.287e+06
spatial_ref int64 8B 0
* band (band) int64 80B 1 2 3 4 5 6 7 9 10 11STEP 4: K-MEANS
Cluster your data by spectral signature using the k-means algorithm.
- Convert your DataArray into a tidy DataFrame of reflectance values (hint: check out the
.to_dataframe()and.unstack()methods) - Filter out all rows with no data (all 0s or any N/A values)
- Fit a k-means model. You can experiment with the number of groups to find what works best.
# Convert spectral DataArray to a tidy DataFrame
# Running the fit and predict functions at the same time.
# We can do this since we don't have target data.
# Add the predicted values back to the model DataFrameSee our solution!
# Convert spectral DataArray to a tidy DataFrame
model_df = reflectance_da.to_dataframe().reflectance.unstack('band')
model_df = model_df.drop(columns=[10, 11]).dropna()
# Running the fit and predict functions at the same time.
# We can do this since we don't have target data.
prediction = KMeans(n_clusters=6).fit_predict(model_df.values)
# Add the predicted values back to the model DataFrame
model_df['clusters'] = prediction
model_df| band | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 9 | clusters | |
|---|---|---|---|---|---|---|---|---|---|---|
| y | x | |||||||||
| 3.303783e+06 | 810148.062907 | 0.01560 | 0.0225 | 0.0409 | 0.0366 | 0.0478 | 0.0281 | 0.0181 | 0.0006 | 2 |
| 810178.062907 | 0.01895 | 0.0256 | 0.0396 | 0.0413 | 0.0426 | 0.0284 | 0.0220 | 0.0006 | 2 | |
| 810208.062907 | 0.01915 | 0.0246 | 0.0387 | 0.0377 | 0.0384 | 0.0273 | 0.0236 | 0.0007 | 2 | |
| 810238.062907 | 0.02040 | 0.0247 | 0.0440 | 0.0445 | 0.0629 | 0.0418 | 0.0266 | 0.0007 | 2 | |
| 810268.062907 | 0.01815 | 0.0245 | 0.0437 | 0.0444 | 0.0618 | 0.0397 | 0.0259 | 0.0008 | 2 | |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3.287163e+06 | 793768.062907 | 0.02650 | 0.0345 | 0.0548 | 0.0427 | 0.0218 | 0.0098 | 0.0074 | 0.0007 | 2 |
| 793798.062907 | 0.02790 | 0.0351 | 0.0549 | 0.0439 | 0.0221 | 0.0104 | 0.0076 | 0.0008 | 2 | |
| 793828.062907 | 0.02580 | 0.0331 | 0.0534 | 0.0419 | 0.0194 | 0.0080 | 0.0059 | 0.0009 | 2 | |
| 793858.062907 | 0.02570 | 0.0326 | 0.0521 | 0.0402 | 0.0182 | 0.0064 | 0.0046 | 0.0007 | 2 | |
| 793888.062907 | 0.02550 | 0.0340 | 0.0541 | 0.0423 | 0.0199 | 0.0083 | 0.0060 | 0.0007 | 2 |
317917 rows × 9 columns
STEP 5: PLOT
Create a plot that shows the k-means clusters next to an RGB image of the area. You may need to brighten your RGB image by multiplying it by 10. The code for reshaping and plotting the clusters is provided for you below, but you will have to create the RGB plot yourself!
So, what is .sortby(['x', 'y']) doing for us? Try the code without it and find out.
# Plot the k-means clusters
(
rgb_plot
+
model_df.clusters.to_xarray().sortby(['x', 'y']).hvplot(
cmap="Colorblind", aspect='equal')
)See our solution!
rgb = reflectance_da.sel(band=[4, 3, 2])
rgb_uint8 = (rgb * 255).astype(np.uint8).where(rgb!=np.nan)
rgb_bright = rgb_uint8 * 10
rgb_sat = rgb_bright.where(rgb_bright < 255, 255)
(
rgb_sat.hvplot.rgb(
x='x', y='y', bands='band',
data_aspect=1,
xaxis=None, yaxis=None)
+
model_df.clusters.to_xarray().sortby(['x', 'y']).hvplot(
cmap="Colorblind", aspect='equal')
)Don’t forget to interpret your plot!
YOUR PLOT HEADLINE AND DESCRIPTION HERE